Mastering Data Visualization with R

Author

Martin Schweinberger

Published

January 1, 2026

Introduction

This tutorial introduces data visualisation with R, focusing on the ggplot2 package. It covers a wide range of plot types suited to different data structures and research questions — from scatter plots and distribution plots to Likert scale visualisations, heatmaps, time series, and publication-ready figures. Throughout, the emphasis is on choosing the right visualisation for a given question, understanding the grammar of graphics that underlies ggplot2, and developing the habits that lead to clear, reproducible, and honest data communication.

The tutorial works through a concrete dataset on preposition frequencies in historical English texts, providing a continuous research narrative that connects the individual examples. Exercises at the end of each section consolidate understanding.

Learning Objectives

By the end of this tutorial you will be able to:

  1. Explain the grammar of graphics and how it structures ggplot2 code
  2. Choose an appropriate visualisation type for a given data structure and research question
  3. Create scatter plots, density plots, histograms, ridge plots, boxplots, violin plots, bar plots, heatmaps, line graphs, and ribbon plots in ggplot2
  4. Visualise Likert scale survey data using grouped bar plots and gglikert
  5. Customise plots with themes, colour palettes, labels, and annotations
  6. Apply accessibility principles including redundant encoding and colourblind-safe palettes
  7. Combine multiple plots into a single figure using patchwork
  8. Save publication-quality figures in appropriate formats and resolutions
  9. Avoid common visualisation mistakes including truncated axes, chartjunk, and overplotting
Prerequisite Tutorials

Before working through this tutorial, you should be familiar with:

Citation

Martin Schweinberger. 2026. Mastering Data Visualization with R. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/dviz/dviz.html (Version 2026.05.01).


Setup and Preparation

Section Overview

What you will learn: Which packages are needed and why; how to load the tutorial dataset; and how to set up a consistent colour palette for use throughout the tutorial

Installing required packages

Run this code once to install all required packages. It may take a few minutes.

Code
install.packages("dplyr")
install.packages("stringr")
install.packages("ggplot2")
install.packages("tidyr")
install.packages("scales")
install.packages("ggridges")
install.packages("ggstats")
install.packages("ggstatsplot")
install.packages("EnvStats")
install.packages("likert")
install.packages("vcd")
install.packages("hexbin")
install.packages("patchwork")    # Combining multiple plots
install.packages("viridis")      # Colourblind-safe palettes
install.packages("flextable")
install.packages("devtools")

# Install ggflags from GitHub (country flags in plots)
devtools::install_github("jimjam-slam/ggflags")

Loading packages

Code
library(dplyr)
library(stringr)
library(ggplot2)
library(tidyr)
library(flextable)
library(hexbin)
library(patchwork)
library(ggflags)
library(ggstats)
library(ggridges)
library(EnvStats)
library(scales)
library(viridis)

Loading and inspecting the data

We work throughout this tutorial with a dataset on preposition frequencies in historical English texts from the Penn Parsed Corpora of Historical English (PPCME, PPCEME, PPCMBE). Each row represents one text, and the key variables are described below.

Code
pdat <- base::readRDS("tutorials/dviz/data/pvd.rda", "rb")

Date

Genre

Text

Prepositions

Region

GenreRedux

DateRedux

1,736

Science

albin

166.01

North

NonFiction

1700-1799

1,711

Education

anon

139.86

North

NonFiction

1700-1799

1,808

PrivateLetter

austen

130.78

North

Conversational

1800-1913

1,878

Education

bain

151.29

North

NonFiction

1800-1913

1,743

Education

barclay

145.72

North

NonFiction

1700-1799

1,908

Education

benson

120.77

North

NonFiction

1800-1913

1,906

Diary

benson

119.17

North

Conversational

1800-1913

1,897

Philosophy

boethja

132.96

North

NonFiction

1800-1913

1,785

Philosophy

boethri

130.49

North

NonFiction

1700-1799

1,776

Diary

boswell

135.94

North

Conversational

1700-1799

1,905

Travel

bradley

154.20

North

NonFiction

1800-1913

1,711

Education

brightland

149.14

North

NonFiction

1700-1799

1,762

Sermon

burton

159.71

North

Religious

1700-1799

1,726

Sermon

butler

157.49

North

Religious

1700-1799

1,835

PrivateLetter

carlyle

124.16

North

Conversational

1800-1913

Variable descriptions:

  • Date — year the text was written (continuous)
  • Genre — text genre (Fiction, Legal, Religious, etc.)
  • Text — source text identifier
  • Prepositions — relative frequency of prepositions per 1,000 words
  • Region — geographic origin of the text (North/South)
  • GenreRedux — simplified genre categories (5 levels)
  • DateRedux — time period categories (1150–1499, 1500–1599, etc.)

Setting up a colour palette

Using a consistent colour palette across all visualisations creates a coherent, professional look and reduces the cognitive load of switching between colour schemes. We define five colours here that we will reuse throughout.

Code
clrs <- c("purple", "gray80", "lightblue", "orange", "gray30")
Colour resources
  • R Color Reference — all named colours in R
  • ColorBrewer — palettes designed for maps and data visualisation, many colourblind-safe
  • Viridis — perceptually uniform, colourblind-safe palettes

For accessibility, prefer palettes from the viridis package or scale_color_brewer() with "Set2" or "Dark2".


Part 1: The Grammar of Graphics

Section Overview

What you will learn: The conceptual framework underlying ggplot2; the seven components of every plot; and how to read and write ggplot2 code systematically

Why ggplot2?

ggplot2 is the dominant data visualisation package in R for good reason. It is based on a coherent theoretical framework — the grammar of graphics — that makes it possible to construct any plot from a small set of building blocks. Rather than memorising individual plot functions, you learn a system: once you understand the grammar, you can build plots you have never seen before by composing components in new ways.

The grammar of graphics, formalised by Wilkinson (2005) and implemented in ggplot2 by Wickham (2010), describes a plot as the result of mapping data to aesthetics through geometric objects, with additional components controlling scales, coordinate systems, facets, and themes.

The seven components

Every ggplot2 plot is built from up to seven components:

1. Data — the data frame containing the variables to be visualised. Passed as the first argument to ggplot().

2. Aesthetics (aes()) — the mapping from data variables to visual properties: which variable goes on the x-axis, which on the y-axis, which controls colour, size, shape, transparency, and so on. Aesthetics defined inside ggplot() apply to all layers; aesthetics inside a specific geom_*() apply only to that layer.

3. Geometries (geom_*()) — the geometric objects used to represent the data. Points, lines, bars, boxes, ribbons, tiles, and text are all geometries. Each geom_*() call adds a new layer to the plot.

4. Scales (scale_*()) — control how aesthetic mappings are translated into visual properties. For example, scale_color_manual() specifies exact colours; scale_x_log10() log-transforms the x-axis; scale_y_continuous(labels = scales::percent) formats y-axis labels as percentages.

5. Facets (facet_wrap(), facet_grid()) — split the data into subplots by the values of one or more categorical variables. Faceting is one of the most powerful features of ggplot2 for comparing patterns across groups.

6. Coordinate system (coord_*()) — controls the space in which the plot is drawn. coord_flip() swaps x and y; coord_polar() creates polar (circular) coordinates; coord_cartesian() sets axis limits without dropping data points.

7. Theme (theme_*(), theme()) — controls all non-data visual elements: background colour, gridlines, font sizes, axis tick marks, legend position, and so on. theme_bw() and theme_minimal() are good defaults for publication work.

The ggplot2 template

Every ggplot2 call follows this template:

Code
ggplot(data = <DATA>, aes(x = <X>, y = <Y>, color = <GROUP>)) +
  geom_<TYPE>(<PARAMETERS>) +
  scale_<AESTHETIC>_<TYPE>(<PARAMETERS>) +
  facet_<TYPE>(vars(<VARIABLE>)) +
  coord_<TYPE>() +
  theme_<STYLE>() +
  labs(title = "<TITLE>", x = "<X LABEL>", y = "<Y LABEL>")

The + operator adds layers and components to the plot. The order generally does not matter for the final result, but it is conventional to put data layers first, then scales, then facets, then theme, then labels.

Reading existing ggplot2 code

When you encounter unfamiliar ggplot2 code, read it layer by layer. Ask: what data is being used? What is mapped to x, y, colour, and other aesthetics? What geometric objects are being drawn? What scales and themes have been applied? This decomposition makes even complex plots understandable.






Part 2: Exploring Relationships

Section Overview

What you will learn: Scatter plots as the foundation for showing relationships between two continuous variables; adding colour, shape, and trend lines; using facets; managing overplotting with transparency, density contours, and hex plots

Scatter plots

Scatter plots are the most direct way to visualise the relationship between two continuous variables. Each point represents one observation.

When to use: Two continuous variables; sample size small enough that individual points can be seen (roughly < 5,000 without overplotting strategies).

Basic scatter plot

Code
ggplot(data = pdat,
       aes(x = Date,
           y = Prepositions)) +
  geom_point() +
  theme_bw() +
  labs(x = "Year",
       y = "Prepositions per 1,000 words")

Reading the code
  • ggplot() initialises the plot and sets the default data and aesthetics
  • aes(x = Date, y = Prepositions) maps the variable Date to the x-axis and Prepositions to the y-axis
  • geom_point() adds a layer of points — one per row in the data
  • theme_bw() applies a clean black-and-white theme
  • labs() sets axis labels

Adding colour and shape

Using both colour and shape to encode the same variable is called redundant encoding. It makes plots more accessible: readers who cannot distinguish colours (about 8% of men have some form of colour vision deficiency) can still use the shapes, and the plot retains its meaning when printed in greyscale.

Code
ggplot(pdat,
       aes(Date, Prepositions,
           color = GenreRedux,
           shape = GenreRedux)) +
  geom_point(size = 2) +
  scale_shape_manual(name = "Genre", values = 1:5) +
  scale_color_manual(name = "Genre", values = clrs) +
  theme_bw() +
  theme(legend.position = "top") +
  labs(x = "Year", y = "Prepositions per 1,000 words")

Faceted scatter plots with trend lines

When points from multiple groups overlap, faceting into separate panels makes individual group patterns visible. Adding a trend line with geom_smooth() makes the overall direction of change within each group explicit.

Code
ggplot(pdat, aes(Date, Prepositions, color = Genre)) +
  facet_wrap(vars(Genre), ncol = 4) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.8) +
  theme_bw() +
  theme(
    legend.position = "none",
    axis.text.x = element_text(size = 8, angle = 90)
  ) +
  labs(x = "Year", y = "Prepositions per 1,000 words")

Facets: when to use them

Facets work best when you have 3–8 groups whose within-group patterns are the focus, and when direct across-group value comparison is less important than seeing each group’s trend clearly. Avoid facets when groups need to be directly overlaid for comparison, or when you have more than about 10 groups.

Managing overplotting

When many points occupy the same region, individual points become invisible. Three strategies address this:

Transparency (alpha) — making points semi-transparent so density is visible as colour intensity.

2D density contours (geom_density_2d) — contour lines showing where data is concentrated, like a topographic map.

Hex plots (geom_hex) — the plotting region is divided into hexagonal bins; each bin is coloured by the number of points it contains. Effective for very large datasets.

Code
ggplot(pdat, aes(x = Date, y = Prepositions, color = GenreRedux)) +
  facet_wrap(vars(GenreRedux), ncol = 5) +
  geom_density_2d() +
  theme_bw() +
  theme(
    legend.position = "none",
    axis.text.x = element_text(size = 8, angle = 90)
  ) +
  labs(x = "Year", y = "Prepositions per 1,000 words")

Code
pdat |>
  ggplot(aes(x = Date, y = Prepositions)) +
  geom_hex() +
  scale_fill_gradient(low = "lightblue", high = "darkblue",
                      name = "Count") +
  theme_bw() +
  labs(x = "Year", y = "Prepositions per 1,000 words",
       title = "Hex plot: point density")

Approach Best for Limitation
Points Small–medium datasets, seeing all data Gets cluttered with many points
Transparency Moderate overplotting Still unclear at very high density
Density contours Showing concentration patterns Harder to interpret than points
Hex bins Very large datasets Requires comparable x–y scales

Part 3: Showing Distributions

Section Overview

What you will learn: Density plots, histograms, ridge plots, boxplots, and violin plots — when each is appropriate and what each reveals that the others do not

Density plots

Density plots show the estimated probability density of a continuous variable as a smooth curve. They are particularly useful for comparing the shape of a distribution across groups.

Code
ggplot(pdat, aes(Date, fill = Region)) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = clrs[1:2]) +
  theme_bw() +
  theme(legend.position = c(0.1, 0.9)) +
  labs(x = "Year", y = "Density",
       title = "Temporal distribution of texts by region")

The plot shows that southern texts continue into the 1800s while northern texts end around 1700, with a period of overlap in between.

Histograms

Histograms divide a continuous variable into equal-width bins and count how many observations fall in each. Unlike density plots, they show actual counts and make the discretisation of the data explicit.

Code
ggplot(pdat, aes(Prepositions)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  theme_bw() +
  labs(title = "Distribution of preposition frequencies",
       x = "Prepositions per 1,000 words",
       y = "Count")

Histogram vs. bar plot

A histogram shows the distribution of one continuous variable. The bins are ranges of values, and there are no gaps between bars (the variable is continuous).

A bar plot shows counts or values for discrete categories. Bars are separated by gaps to reflect the categorical (not continuous) nature of the x-axis.

Confusing the two is one of the most common plotting mistakes in student work.

Ridge plots

Ridge plots (also called joy plots) show offset density curves for multiple groups, making it easy to compare shapes across many groups simultaneously. They are particularly effective when you have more groups than can comfortably be shown in overlapping densities.

Code
pdat |>
  ggplot(aes(x = Prepositions, y = GenreRedux, fill = GenreRedux)) +
  geom_density_ridges() +
  theme_ridges() +
  theme(legend.position = "none") +
  labs(y = "", x = "Relative frequency of prepositions per 1,000 words",
       title = "Preposition frequency distributions by genre")

Boxplots

Boxplots display five summary statistics simultaneously: the median (line inside the box), the first and third quartiles (the box edges, enclosing the interquartile range, IQR), and the whiskers extending to 1.5 times the IQR beyond each box edge. Points beyond the whiskers are plotted individually as potential outliers.

Code
ggplot(pdat, aes(DateRedux, Prepositions, fill = DateRedux)) +
  geom_boxplot() +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = "Time period", y = "Prepositions per 1,000 words")

Notched boxplots

Adding notch = TRUE draws notches around the median. If notches of two boxes do not overlap, there is strong visual evidence that the medians differ significantly. This is a useful quick check, though it is not a substitute for formal statistical testing.

Code
ggplot(pdat, aes(DateRedux, Prepositions, fill = DateRedux)) +
  geom_boxplot(notch = TRUE,
               outlier.colour = "red",
               outlier.shape = 2,
               outlier.size = 3) +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = "Time period", y = "Prepositions per 1,000 words",
       title = "Notched boxplots: overlapping notches suggest similar medians")

Enhanced boxplots with jittered points

Overlaying the individual data points on the boxplot reveals the sample size and distribution simultaneously.

Code
ggplot(pdat, aes(DateRedux, Prepositions, fill = DateRedux, color = DateRedux)) +
  geom_boxplot(varwidth = TRUE, color = "black", alpha = 0.3) +
  geom_jitter(alpha = 0.3, height = 0, width = 0.2) +
  facet_grid(~Region) +
  EnvStats::stat_n_text(y.pos = 65) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = "", y = "Frequency per 1,000 words",
       title = "Preposition use across time and regions",
       subtitle = "Box width proportional to sample size; n shown below each box")

Violin plots

Violin plots mirror a density plot on both sides of a central axis, giving them their characteristic shape. They show the full distribution shape — including multimodality — while remaining compact enough to compare across groups.

Code
ggplot(pdat, aes(DateRedux, Prepositions, fill = DateRedux)) +
  geom_violin(trim = FALSE, alpha = 0.5) +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = "Time period", y = "Prepositions per 1,000 words",
       title = "Violin plots reveal distribution shape")

Choosing between distribution plot types

Plot type Reveals Best for Avoid when
Histogram Counts in bins Single variable, showing counts Comparing many groups
Density Smooth shape Comparisons, overlapping groups Exact counts needed
Ridge Multiple shapes Many groups (> 4) Fewer than 3 groups
Boxplot Five-number summary + outliers Statistical summaries Distribution shape matters
Violin Shape + summary Detecting multimodality Very small samples





Part 4: Categorical Data

Section Overview

What you will learn: Bar plots in their basic, grouped, stacked, and normalised forms; Likert scale visualisation; and the case against pie charts

Bar plots

Bar plots show counts, frequencies, or summary values for categorical groups. They are the workhorse of categorical data visualisation.

First, we create summary data:

Code
bdat <- pdat |>
  dplyr::mutate(DateRedux = factor(DateRedux)) |>
  group_by(DateRedux) |>
  dplyr::summarise(Frequency = n()) |>
  dplyr::mutate(Percent = round(Frequency / sum(Frequency) * 100, 1))

bdat
# A tibble: 5 × 3
  DateRedux Frequency Percent
  <fct>         <int>   <dbl>
1 1150-1499        34     6.3
2 1500-1599       180    33.5
3 1600-1699       225    41.9
4 1700-1799        53     9.9
5 1800-1913        45     8.4

Basic bar plot

Code
ggplot(bdat, aes(DateRedux, Percent, fill = DateRedux)) +
  geom_bar(stat = "identity") +
  geom_text(aes(y = Percent - 3,
                label = paste0(Percent, "%")),
            color = "white", size = 4) +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = "Time period",
       y = "Percentage of documents",
       title = "Distribution of texts across time periods")

stat = "identity" explained

geom_bar() defaults to stat = "count", which counts the number of rows per group. When your data already contains the values to plot — as bdat$Percent does here — use stat = "identity" to plot the values as given without any additional aggregation.

Grouped and stacked bar plots

Code
ggplot(pdat, aes(Region, fill = DateRedux)) +
  geom_bar(position = position_dodge(), stat = "count") +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  labs(x = "Region", y = "Number of documents", fill = "Time period",
       title = "Document counts by region and time period (grouped)")

Code
ggplot(pdat, aes(DateRedux, fill = GenreRedux)) +
  geom_bar(stat = "count") +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  labs(x = "Time period", y = "Number of documents", fill = "Genre",
       title = "Genre composition across time periods (stacked)")

Code
ggplot(pdat, aes(DateRedux, fill = GenreRedux)) +
  geom_bar(stat = "count", position = "fill") +
  scale_fill_manual(values = clrs) +
  scale_y_continuous(labels = scales::percent) +
  theme_bw() +
  labs(x = "Time period", y = "Proportion of documents", fill = "Genre",
       title = "Relative genre composition over time (100% stacked)")

Bar type Use when
Basic / grouped Comparing absolute counts across groups
Stacked Showing composition and total simultaneously
100% normalised Only proportions matter, not absolute counts

Likert scale visualisations

Survey data recorded on Likert scales (e.g. Strongly Disagree to Strongly Agree) requires careful visualisation because the response categories are ordered, the neutral midpoint is meaningful, and the visual emphasis should reflect valence.

Code
ldat <- base::readRDS("tutorials/dviz/data/lid.rda", "rb")
head(ldat)
   Course Satisfaction
1 Chinese            1
2 Chinese            1
3 Chinese            1
4 Chinese            1
5 Chinese            1
6 Chinese            1

Grouped bar plot

Code
nlik <- ldat |>
  dplyr::group_by(Course, Satisfaction) |>
  dplyr::summarize(Frequency = n(), .groups = "drop")

ggplot(nlik, aes(Satisfaction, Frequency, fill = Course)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  scale_fill_manual(values = clrs[1:3]) +
  geom_text(aes(label = Frequency),
            vjust = 1.6, color = "white",
            position = position_dodge(0.9), size = 3.5) +
  scale_x_discrete(
    limits = 1:5,
    labels = c("Very\nDissatisfied", "Dissatisfied",
               "Neutral", "Satisfied", "Very\nSatisfied")
  ) +
  theme_bw() +
  labs(title = "Student satisfaction by course",
       x = "Satisfaction level", y = "Number of students")

Cumulative distribution plot

Code
ggplot(ldat, aes(x = Satisfaction, color = Course)) +
  geom_step(aes(y = after_stat(y)), stat = "ecdf", linewidth = 1.5) +
  scale_colour_manual(values = clrs[1:3]) +
  scale_x_discrete(
    limits = 1:5,
    labels = c("Very\nDissatisfied", "Dissatisfied",
               "Neutral", "Satisfied", "Very\nSatisfied")
  ) +
  theme_bw() +
  labs(title = "Cumulative satisfaction distribution",
       y = "Cumulative proportion", x = "Satisfaction level")

Reading cumulative distribution plots

A steeper slope at any point means responses are concentrated in that range. A line that runs high on the left means many dissatisfied respondents. When two lines cross, it means the distributions have different shapes — one group may have more extreme responses in both directions.

gglikert: diverging bar chart

The gglikert() function from the ggstats package creates diverging stacked bar charts that place negative responses on the left and positive responses on the right, with neutral in the middle. This is currently considered the most effective visualisation for Likert data.

Code
sdat <- base::readRDS("tutorials/dviz/data/sdd.rda", "rb")

colnames(sdat)[3:ncol(sdat)] <- paste0(
  "Q", str_pad(1:10, 2, "left", "0"), ": ",
  colnames(sdat)[3:ncol(sdat)]
) |>
  stringr::str_replace_all("\\.", " ") |>
  stringr::str_squish() |>
  stringr::str_replace_all("$", "?")

lbs <- c("Disagree", "Somewhat\nDisagree", "Neutral",
         "Somewhat\nAgree", "Agree")

survey <- sdat |>
  dplyr::mutate_if(is.character, factor) |>
  dplyr::mutate_if(is.numeric, factor, levels = 1:5, labels = lbs) |>
  drop_na() |>
  as.data.frame()

survey |>
  dplyr::select(matches("01|02|03|04")) |>
  gglikert(labels_size = 2.5, add_labels = FALSE) +
  ggtitle("Survey responses: selected questions") +
  scale_fill_brewer(palette = "RdBu")

Likert visualisation best practices
  • Keep response categories in their natural order — never sort by frequency
  • Use a diverging colour palette (e.g. red–blue) centred on the neutral midpoint
  • Show the neutral category separately in the middle of the bar
  • Include sample sizes when comparing groups
  • Prefer diverging bar charts over plain stacked bars for communication

Pie charts: use with caution

The case against pie charts

Human visual perception is much better at comparing lengths (bar plot) than angles or areas (pie chart). Research consistently shows that people make more accurate judgements from bar charts than from pie charts, especially when slices are of similar size or when there are more than three categories.

Pie charts may be acceptable when there are only two or three categories and one clearly dominates. In most other situations, a bar chart communicates more accurately.

Code
piedata <- bdat |>
  dplyr::arrange(desc(DateRedux)) |>
  dplyr::mutate(Position = cumsum(Percent) - 0.5 * Percent)

p_bar <- ggplot(bdat, aes("", Percent, fill = DateRedux)) +
  geom_bar(stat = "identity", position = position_dodge(), width = 0.7) +
  scale_fill_manual(values = clrs) +
  theme_minimal() +
  labs(title = "Bar plot", y = "Percent", x = "")

p_pie <- ggplot(piedata, aes("", Percent, fill = DateRedux)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y", start = 0) +
  scale_fill_manual(values = clrs) +
  theme_void() +
  geom_text(aes(y = Position, label = paste0(Percent, "%")),
            color = "white", size = 4) +
  labs(title = "Pie chart")

p_bar + p_pie

Without looking at the percentage labels, try to identify the second-largest category in each plot. The bar plot makes this easy; the pie chart makes it difficult.






Part 5: Advanced Visualisations

Section Overview

What you will learn: Heatmaps and association plots for matrix data; word clouds for text data; flag plots for international comparisons; dot plots with error bars; and diverging bar plots

Heatmaps

Heatmaps use colour intensity to represent values in a two-dimensional matrix. They are effective for showing patterns across many combinations of two categorical variables.

Code
heatdata <- pdat |>
  dplyr::group_by(DateRedux, GenreRedux) |>
  dplyr::summarise(Prepositions = mean(Prepositions), .groups = "drop") |>
  tidyr::spread(DateRedux, Prepositions)

heatmx <- as.matrix(heatdata[, 2:5])
rownames(heatmx) <- heatdata$GenreRedux
heatmx_scaled <- scale(heatmx)
Code
heatmap(heatmx_scaled,
        scale  = "none",
        col    = colorRampPalette(c("blue", "white", "red"))(50),
        margins = c(7, 10),
        main   = "Preposition frequency: standardised mean by genre and period")

The dendrograms show which genres (rows) and time periods (columns) cluster together based on their preposition frequency profiles. Blue indicates below-average frequency; red indicates above-average frequency.

Association and mosaic plots

Association plots and mosaic plots from the vcd package visualise the relationship between two categorical variables, showing deviations from statistical independence.

Code
library(vcd)

assocdata <- pdat |>
  dplyr::mutate(
    GenreRedux = dplyr::case_when(
      GenreRedux == "Conversational" ~ "Conv.",
      GenreRedux == "Religious"      ~ "Relig.",
      TRUE ~ GenreRedux
    )
  ) |>
  dplyr::group_by(GenreRedux, DateRedux) |>
  dplyr::summarise(Prepositions = round(mean(Prepositions), 0),
                   .groups = "drop") |>
  tidyr::spread(DateRedux, Prepositions)

assocmx <- as.matrix(assocdata[, 2:6])
rownames(assocmx) <- assocdata$GenreRedux
Code
assoc(assocmx, shade = TRUE,
      main = "Association plot: genre by time period")

Code
mosaic(assocmx, shade = TRUE, legend = TRUE,
       main = "Mosaic plot: genre composition over time")

Interpreting these plots:

  • Bars or tiles above the baseline: more than expected under independence
  • Bars or tiles below the baseline: less than expected
  • Blue shading: significantly more than expected (p < 0.05)
  • Red shading: significantly less than expected (p < 0.05)
  • Bar width in the association plot: contribution to the chi-square statistic

Word clouds

Word clouds represent term frequencies visually, with word size proportional to frequency. They are visually engaging but imprecise — word sizes are difficult to compare accurately. Use them for exploratory purposes or presentations, not as primary evidence in a paper.

Code
library(quanteda)
library(quanteda.textplots)

clinton <- base::readRDS("tutorials/dviz/data/Clinton.rda", "rb") |>
  paste0(collapse = " ")
trump   <- base::readRDS("tutorials/dviz/data/Trump.rda", "rb") |>
  paste0(collapse = " ")

corp_dom <- quanteda::corpus(c(clinton, trump))
attr(corp_dom, "docvars")$Author <- c("Clinton", "Trump")

dfm_dom <- corp_dom |>
  quanteda::tokens(remove_punct = TRUE) |>
  quanteda::tokens_remove(stopwords("english")) |>
  quanteda::dfm() |>
  quanteda::dfm_group(groups = corp_dom$Author) |>
  quanteda::dfm_trim(min_termfreq = 200, verbose = FALSE)
Code
dfm_dom |>
  quanteda.textplots::textplot_wordcloud(
    comparison = TRUE,
    max_words  = 50,
    color      = c("blue", "red")
  )

Country flags in visualisations

The ggflags package allows country flags to be used as data point markers, making international comparisons more immediately readable.

Code
flagsdf <- data.frame(
  Region  = c("Australia", "Canada", "Great Britain", "India",
               "Ireland", "New Zealand", "United States"),
  Percent = c(0.022, 0.017, 0.025, 0.010, 0.019, 0.020, 0.036),
  Kachru  = c("Inner circle", "Inner circle", "Inner circle", "Outer circle",
               "Inner circle", "Inner circle", "Inner circle"),
  country = c("au", "ca", "gb", "in", "ie", "nz", "us")
)
Code
flagsdf |>
  ggplot(aes(x = reorder(Region, Percent),
             y = Percent,
             country = country,
             fill = Kachru)) +
  geom_bar(stat = "identity") +
  ggflags::geom_flag(size = 5) +
  geom_text(aes(label = scales::percent(Percent, accuracy = 0.1)),
            hjust = -0.3, size = 3) +
  coord_flip(ylim = c(0, 0.045)) +
  scale_fill_manual(values = c("lightblue", "coral")) +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal() +
  labs(x = "", y = "Vulgar language percentage",
       title = "Vulgar language use by English-speaking region",
       fill = "English type") +
  theme(legend.position = c(0.8, 0.3),
        panel.grid.major = element_blank())

Dot plots with error bars

Dot plots showing means with confidence intervals are often preferable to bar plots for continuous outcomes because they avoid the visual distortion caused by showing the mean as the height of a bar that starts at zero.

Code
ggplot(pdat, aes(x = reorder(Genre, Prepositions, mean),
                 y = Prepositions,
                 group = Genre)) +
  stat_summary(fun = mean, geom = "point", size = 4,
               aes(color = Genre)) +
  stat_summary(fun.data = mean_cl_boot, geom = "errorbar",
               width = 0.2, linewidth = 1) +
  coord_cartesian(ylim = c(80, 200)) +
  theme_bw(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  labs(x = "", y = "Prepositions per 1,000 words",
       title = "Mean preposition frequency by genre",
       subtitle = "Error bars show 95% bootstrap confidence intervals")

Diverging bar plots

Diverging bar plots show deviation from a reference value, with positive deviations extending in one direction and negative in the other. They are useful for comparing group profiles against a baseline.

Code
Test1 <- c(11.2, 13.5, 200, 185, 1.3, 3.5)
Test2 <- c(12.2, 14.7, 210, 175, 1.9, 3.0)
Test3 <- c(13.2, 15.1, 177, 173, 2.4, 2.9)

testdata <- data.frame(Test1, Test2, Test3)
rownames(testdata) <- c(
  "Feature1_Student", "Feature1_Reference",
  "Feature2_Student", "Feature2_Reference",
  "Feature3_Student", "Feature3_Reference"
)

plottable <- data.frame(
  Test    = rep(rownames(t(testdata[1,] - testdata[2,])), 3),
  Value   = c(t(testdata[1,] - testdata[2,]),
              t(testdata[3,] - testdata[4,]),
              t(testdata[5,] - testdata[6,])),
  Feature = rep(c("Feature A", "Feature B", "Feature C"), each = 3)
)

ggplot(plottable, aes(Test, Value, fill = Test)) +
  facet_grid(vars(Feature), scales = "free_y") +
  geom_bar(stat = "identity") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  scale_fill_manual(values = clrs[1:3]) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = "Test", y = "Deviation from reference",
       title = "Learner performance relative to native speaker reference",
       subtitle = "Positive = above reference; negative = below reference")


Part 6: Time Series and Line Graphs

Section Overview

What you will learn: Line graphs for discrete and continuous time variables; smoothed trend lines; ribbon plots for displaying uncertainty; and how to choose between these approaches

Basic line graphs

Line graphs connect data points in temporal order, making trends and trajectories visible. The group aesthetic tells ggplot2 which points to connect.

Code
pdat |>
  dplyr::group_by(DateRedux, GenreRedux) |>
  dplyr::summarise(Frequency = mean(Prepositions), .groups = "drop") |>
  ggplot(aes(x = DateRedux, y = Frequency,
             group = GenreRedux,
             color = GenreRedux)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  scale_color_manual(values = clrs) +
  theme_minimal() +
  labs(title = "Preposition frequency over time by genre",
       x = "Time period",
       y = "Mean frequency per 1,000 words",
       color = "Genre")

Smoothed line graphs

For continuous time variables with many data points, LOESS smoothing (locally estimated scatterplot smoothing) reveals the underlying trend while absorbing noise from individual observations.

Code
ggplot(pdat, aes(x = Date, y = Prepositions,
                 color = GenreRedux,
                 linetype = GenreRedux)) +
  geom_smooth(se = FALSE, linewidth = 1.2) +
  scale_linetype_manual(
    values = c("solid", "dashed", "dotted", "dotdash", "longdash"),
    name = "Genre"
  ) +
  scale_colour_manual(values = clrs, name = "Genre") +
  theme_bw() +
  theme(legend.position = "top") +
  labs(x = "Year", y = "Relative frequency\nper 1,000 words",
       title = "Smoothed trends in preposition use (LOESS)")

Using both colour and line type (redundant encoding) keeps the lines distinguishable in greyscale and for readers with colour vision deficiency.

Ribbon plots: showing uncertainty

Ribbon plots (geom_ribbon) display ranges or intervals as shaded bands around a central line. They are effective for communicating uncertainty, variability, or the full range of observed values.

Code
pdat |>
  dplyr::mutate(DateRedux = as.numeric(DateRedux)) |>
  dplyr::group_by(DateRedux) |>
  dplyr::summarise(
    Mean = mean(Prepositions),
    Min  = min(Prepositions),
    Max  = max(Prepositions),
    SD   = sd(Prepositions),
    .groups = "drop"
  ) |>
  ggplot(aes(x = DateRedux, y = Mean)) +
  geom_ribbon(aes(ymin = Min, ymax = Max),
              fill = "gray80", alpha = 0.3) +
  geom_ribbon(aes(ymin = Mean - SD, ymax = Mean + SD),
              fill = "lightblue", alpha = 0.4) +
  geom_line(linewidth = 1.2, color = "darkblue") +
  scale_x_continuous(labels = names(table(pdat$DateRedux))) +
  theme_minimal() +
  labs(title = "Preposition frequency: mean with variability",
       subtitle = "Dark blue = mean; light blue = ±1 SD; grey = full range",
       x = "Time period",
       y = "Frequency per 1,000 words")






Part 7: Combining Plots with patchwork

Section Overview

What you will learn: How to combine multiple ggplot2 plots into a single figure using the patchwork package; layout operators; adding shared titles, subtitles, and labels; and when combining plots is appropriate

Why combine plots?

A multi-panel figure is often more effective than a series of separate plots when:

  • You want readers to compare related results side by side
  • A single visualisation cannot show all the relevant aspects of the data
  • You are preparing a figure for a publication that expects one figure file per result

The patchwork package provides a simple and powerful syntax for combining ggplot2 plots.

Basic patchwork syntax

The three main operators are:

  • | — place plots side by side (horizontal)
  • / — place plots one above the other (vertical)
  • + — add to the current layout (follows row-by-row order)
  • () — group plots for nested layouts
Code
# Create three component plots
p1 <- ggplot(pdat, aes(x = DateRedux, y = Prepositions, fill = DateRedux)) +
  geom_boxplot() +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = "Time period", y = "Prepositions per 1,000 words",
       title = "A: Boxplots")

p2 <- ggplot(pdat, aes(x = Prepositions, y = GenreRedux, fill = GenreRedux)) +
  geom_density_ridges() +
  theme_ridges() +
  theme(legend.position = "none") +
  labs(x = "Prepositions per 1,000 words", y = "",
       title = "B: Ridge plot")

p3 <- pdat |>
  dplyr::group_by(DateRedux, GenreRedux) |>
  dplyr::summarise(Mean = mean(Prepositions), .groups = "drop") |>
  ggplot(aes(x = DateRedux, y = Mean,
             group = GenreRedux, color = GenreRedux)) +
  geom_line(linewidth = 1.1) +
  geom_point(size = 2.5) +
  scale_color_manual(values = clrs) +
  theme_minimal() +
  labs(x = "Time period", y = "Mean frequency",
       color = "Genre", title = "C: Line graph")

# Combine: p1 and p2 side by side, with p3 below
(p1 | p2) / p3

Shared labels and annotations

patchwork provides plot_annotation() for adding overall titles, subtitles, and captions, and plot_layout() for controlling spacing and shared legends.

Code
(p1 | p2) / p3 +
  plot_annotation(
    title    = "Preposition frequency in historical English texts",
    subtitle = "Three complementary views of the same dataset",
    caption  = "Source: Penn Parsed Corpora of Historical English",
    tag_levels = "A"
  )

Collecting legends

When multiple plots share the same colour mapping, you can collect the legends into a single shared legend with plot_layout(guides = "collect").

Code
pa <- ggplot(pdat, aes(DateRedux, Prepositions, fill = GenreRedux)) +
  geom_boxplot() +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  labs(x = "Time period", y = "Prepositions", fill = "Genre")

pb <- ggplot(pdat, aes(DateRedux, fill = GenreRedux)) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = clrs) +
  scale_y_continuous(labels = scales::percent) +
  theme_bw() +
  labs(x = "Time period", y = "Proportion", fill = "Genre")

pa2 <- pa + theme(legend.position = "bottom")
pb2 <- pb + theme(legend.position = "bottom")

pa2 | pb2






Part 8: Publication-Ready Plots and Choosing Wisely

Section Overview

What you will learn: What makes a plot publication-ready; saving figures in the right format and resolution; colour accessibility; a decision framework for choosing plot types; and the most common visualisation mistakes to avoid

The anatomy of a publication-ready plot

A plot ready for a journal article or conference proceedings should have:

  • A clear, informative title and (where appropriate) a subtitle
  • Axis labels that name the variable and include units
  • A legend that is necessary and clearly positioned
  • A theme appropriate to the publication context (usually theme_bw() or theme_minimal() rather than the default grey background)
  • Font sizes large enough to be legible at the final printed size
  • A colourblind-accessible colour palette
  • A caption noting the data source and what error bars or ribbons represent

Complete example

Code
pdat |>
  dplyr::group_by(DateRedux, GenreRedux) |>
  dplyr::summarise(
    Mean = mean(Prepositions),
    SE   = sd(Prepositions) / sqrt(n()),
    N    = n(),
    .groups = "drop"
  ) |>
  ggplot(aes(x = DateRedux, y = Mean,
             color = GenreRedux, group = GenreRedux)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = Mean - SE, ymax = Mean + SE),
                width = 0.2, linewidth = 0.8) +
  scale_color_manual(
    name   = "Text genre",
    values = clrs,
    labels = c("Conversational", "Fiction", "Legal", "Non-fiction", "Religious")
  ) +
  scale_y_continuous(breaks = seq(100, 200, 20), limits = c(100, 200)) +
  theme_bw(base_size = 14) +
  theme(
    legend.position       = c(0.15, 0.65),
    legend.background     = element_rect(fill = "white", color = "black"),
    panel.grid.minor      = element_blank(),
    plot.title            = element_text(face = "bold", size = 16),
    plot.subtitle         = element_text(size = 12, color = "gray30"),
    plot.caption          = element_text(size = 10, hjust = 0)
  ) +
  labs(
    title    = "Historical trends in preposition usage",
    subtitle = "Analysis of English texts from 1150 to 1913",
    x        = "Time period",
    y        = "Mean frequency (per 1,000 words)",
    caption  = "Source: Penn Parsed Corpora of Historical English (PPC)\nError bars show ±1 SE"
  )

Saving figures

Code
# For journal submission (300 dpi minimum)
ggsave("preposition_trends.png",  width = 10, height = 6, dpi = 300)

# For vector graphics (no resolution limit — scales to any size)
ggsave("preposition_trends.pdf",  width = 10, height = 6)

# For web use
ggsave("preposition_trends_web.png", width = 10, height = 6, dpi = 150)
Format guide

PNG — raster format; use for web, slides, and figures containing photographs. Specify dpi = 300 for print.

PDF — vector format; use for journal submission where possible. Scales to any size without loss of quality. Best for plots containing text and sharp geometric elements.

TIFF — some journals require TIFF. Use dpi = 600 for posters.

SVG — vector format; useful for web and for figures you may need to edit further in Inkscape or Illustrator.

Colour accessibility

Approximately 8% of men and 0.5% of women have some form of colour vision deficiency. Designing accessible plots benefits all readers, not only those with colour vision differences.

Code
p_problem <- pdat |>
  dplyr::group_by(DateRedux, GenreRedux) |>
  dplyr::summarise(Mean = mean(Prepositions), .groups = "drop") |>
  ggplot(aes(DateRedux, Mean, fill = GenreRedux)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("red", "green", "blue", "yellow", "purple")) +
  ggtitle("Problematic colours") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
        legend.position = "none")

p_better <- pdat |>
  dplyr::group_by(DateRedux, GenreRedux) |>
  dplyr::summarise(Mean = mean(Prepositions), .groups = "drop") |>
  ggplot(aes(DateRedux, Mean, fill = GenreRedux)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_viridis_d() +
  ggtitle("Colourblind-friendly (viridis)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
        legend.position = "none")

p_problem | p_better

Colourblind-safe options in ggplot2:

  • scale_color_viridis_d() / scale_fill_viridis_d() — for discrete variables
  • scale_color_viridis_c() / scale_fill_viridis_c() — for continuous variables
  • scale_color_brewer(palette = "Set2") or "Dark2" — ColorBrewer palettes, many colourblind-safe
  • Redundant encoding (colour + shape, or colour + line type) as a complement

Choosing the right plot: a decision framework

By data structure

One continuous variable — show distribution:

  • Small samples (< 50): dot plot, strip plot
  • Medium samples (50–500): histogram, density plot
  • Large samples (500+): density plot, violin plot
  • Summary statistics: boxplot

One continuous + one categorical — compare groups:

  • Distributions: boxplot, violin plot, ridge plot
  • Means with uncertainty: dot plot with error bars
  • Show all data: jittered points

Two continuous variables — show relationship:

  • Basic: scatter plot
  • Overplotting: hex plot, 2D density
  • With trend: add geom_smooth()
  • Groups: colour, shape, or facets

Two categorical variables — show association:

  • Frequencies: grouped or stacked bar plot
  • Proportions: 100% normalised bar, mosaic plot
  • Statistical deviations: association plot

Time series — show change:

  • Discrete time points: line graph with points
  • Continuous time: smoothed line, ribbon plot
  • Multiple series: coloured lines or small multiples

Three or more variables — multivariate:

  • Third variable categorical: colour + facets
  • Third variable continuous: colour gradient or bubble size
  • Many variables: heatmap

Common mistakes to avoid

3D charts — almost never appropriate. They distort values through perspective effects and make precise comparison impossible. Use 2D plots with grouping, colour, or facets instead.

Dual y-axes — can be used to misrepresent relationships between variables by independently scaling each axis. Prefer faceted plots or normalising both variables to the same scale.

Truncated y-axis on bar plots — bar heights encode values by length from zero. Cutting the axis at a non-zero value exaggerates differences. Bar plots must start at zero. Dot plots with error bars can use a truncated axis because they do not encode values by length from a baseline.

Too many colours — more than about six colours becomes difficult to distinguish. Consider reducing categories, using facets, or highlighting one group while greying the rest.

Chartjunk — decorative elements (unnecessary gridlines, 3D shadows, background images, clipart) distract from the data and add no information. Start with theme_minimal() or theme_bw() and add only what is needed.

Sorting bars randomly — unless the categories have a natural order (time periods, scale levels), sort bars by value to make rank comparisons easy.






Final Challenge: Capstone Project

Comprehensive data visualisation project

You have learned all the core techniques. The capstone is to create a coherent data story using the pdat dataset (or your own data).

Required components:

  1. At least three different plot types from different sections — one showing distributions, one showing relationships, and one showing categorical comparisons
  2. Publication-ready quality: proper titles, labels and captions; a colourblind-friendly palette; appropriate themes; clear legends
  3. At least one combined figure using patchwork with a shared annotation
  4. A written narrative: a short introduction explaining your research question; brief transition text between plots explaining what each shows; and a conclusion summarising what the visualisations reveal

Example research questions to explore:

  • How has genre composition changed across the historical periods covered in the corpus?
  • Are there regional differences in preposition frequency, and do they interact with time period?
  • Which genres show the greatest variability in preposition use, and what might this reflect about genre norms?

Suggested deliverables: A fully ggplot2::annotated .qmd document with all code, at least three saved publication-quality figures (PNG, 300 dpi), and a brief 2–3 sentence caption for each figure as it would appear in a paper.


Citation & Session Info

Citation

Martin Schweinberger. 2026. Mastering Data Visualization with R. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/dviz/dviz.html (Version 2026.05.01), doi: .

@manual{martinschweinberger2026mastering,
  author       = {Martin Schweinberger},
  title        = {Mastering Data Visualization with R},
  year         = {2026},
  note         = {https://ladal.edu.au/tutorials/dviz/dviz.html},
  organization = {The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia},
  edition      = {2026.05.01}
  doi      = {}
}
Code
sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices datasets  utils     methods  
[8] base     

other attached packages:
 [1] patchwork_1.3.0         checkdown_0.0.13        viridis_0.6.5          
 [4] viridisLite_0.4.2       quanteda.textplots_0.95 quanteda_4.2.0         
 [7] scales_1.4.0            ggstats_0.10.0          ggflags_0.0.4          
[10] ggstatsplot_0.13.0      EnvStats_3.0.0          gridExtra_2.3          
[13] vip_0.4.1               PMCMRplus_1.9.12        rstantools_2.4.0       
[16] hexbin_1.28.5           flextable_0.9.11        tidyr_1.3.2            
[19] ggridges_0.5.6          tm_0.7-16               NLP_0.3-2              
[22] vcd_1.4-13              likert_1.3.5            xtable_1.8-4           
[25] ggplot2_4.0.2           stringr_1.6.0           dplyr_1.2.0            

loaded via a namespace (and not attached):
 [1] mnormt_2.1.2            rematch2_2.1.2          sandwich_3.1-1         
 [4] rlang_1.1.7             magrittr_2.0.4          multcomp_1.4-28        
 [7] compiler_4.4.2          statsExpressions_1.6.2  BWStest_0.2.3          
[10] systemfonts_1.3.1       vctrs_0.7.2             reshape2_1.4.5         
[13] kSamples_1.2-10         pkgconfig_2.0.3         fastmap_1.2.0          
[16] labeling_0.4.3          effectsize_1.0.1        rmarkdown_2.30         
[19] markdown_2.0            ragg_1.5.1              purrr_1.2.1            
[22] xfun_0.56               cachem_1.1.0            Rmpfr_1.0-0            
[25] litedown_0.9            jsonlite_2.0.0          SuppDists_1.1-9.8      
[28] gmp_0.7-5               uuid_1.2-1              psych_2.4.12           
[31] stopwords_2.3           parallel_4.4.2          R6_2.6.1               
[34] stringi_1.8.7           RColorBrewer_1.1-3      lmtest_0.9-40          
[37] estimability_1.5.1      Rcpp_1.1.1              iterators_1.0.14       
[40] knitr_1.51              zoo_1.8-13              parameters_0.28.3      
[43] correlation_0.8.6       Matrix_1.7-2            splines_4.4.2          
[46] tidyselect_1.2.1        rstudioapi_0.17.1       yaml_2.3.10            
[49] codetools_0.2-20        lattice_0.22-6          tibble_3.3.1           
[52] plyr_1.8.9              withr_3.0.2             bayestestR_0.17.0      
[55] S7_0.2.1                askpass_1.2.1           coda_0.19-4.1          
[58] evaluate_1.0.5          survival_3.7-0          RcppParallel_5.1.10    
[61] zip_2.3.2               xml2_1.3.6              pillar_1.11.1          
[64] BiocManager_1.30.27     renv_1.1.7              foreach_1.5.2          
[67] insight_1.4.6           generics_0.1.4          paletteer_1.6.0        
[70] commonmark_2.0.0        glue_1.8.0              slam_0.1-55            
[73] gdtools_0.5.0           emmeans_1.10.7          tools_4.4.2            
[76] data.table_1.17.0       mvtnorm_1.3-3           fastmatch_1.1-8        
[79] datawizard_1.3.0        colorspace_2.1-1        nlme_3.1-166           
[82] cli_3.6.5               textshaping_1.0.0       officer_0.7.3          
[85] fontBitstreamVera_0.1.1 gtable_0.3.6            zeallot_0.1.0          
[88] digest_0.6.39           fontquiver_0.2.1        TH.data_1.1-3          
[91] htmlwidgets_1.6.4       farver_2.1.2            memoise_2.0.1          
[94] htmltools_0.5.9         lifecycle_1.0.5         multcompView_0.1-10    
[97] fontLiberation_0.1.0    openssl_2.3.2           MASS_7.3-61            
AI Transparency Statement

This tutorial was re-developed with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to help revise the tutorial text, structure the instructional content, generate the R code examples, and write the checkdown quiz questions and feedback strings. All content was reviewed, edited, and approved by the author (Martin Schweinberger), who takes full responsibility for the accuracy and pedagogical appropriateness of the material. The use of AI assistance is disclosed here in the interest of transparency and in accordance with emerging best practices for AI-assisted academic content creation.

Back to top

Back to LADAL home

Resources and Further Reading

Books

  • Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis (2nd ed.). Springer. Free online: ggplot2-book.org
  • Healy, K. (2018). Data Visualization: A Practical Introduction. Princeton University Press. Free online: socviz.co
  • Wilke, C. O. (2019). Fundamentals of Data Visualization. O’Reilly. Free online: clauswilke.com/dataviz

Online tools and references

Practice datasets

Built into R: mpg, diamonds, economics, midwest

From packages: palmerpenguins (palmerpenguins), gapminder (gapminder), nycflights13 (nycflights13)


Quick Reference

Common geoms

Geom Use for
geom_point() Scatter plots, dot plots
geom_line() Line graphs, time series
geom_bar() Bar plots (counts or values)
geom_boxplot() Distribution summaries with outliers
geom_violin() Distribution shapes
geom_histogram() Single variable distribution (counts)
geom_density() Smooth distribution curves
geom_smooth() Trend lines and regression curves
geom_errorbar() Confidence intervals, error bars
geom_ribbon() Ranges, uncertainty bands
geom_tile() Heatmaps (ggplot2 version)
geom_hex() Hex bins for large scatter data
geom_density_2d() 2D concentration contours

Common aesthetics

Aesthetic Controls
x, y Axis position
color / colour Border or line colour
fill Interior fill colour
size Point size or text size
linewidth Line thickness (replaces size for lines)
shape Point shape
alpha Transparency (0 = invisible, 1 = opaque)
linetype Line style (solid, dashed, dotted, etc.)
group Which observations to connect (lines)

Common themes

Theme Character
theme_bw() White background, black borders — good for publication
theme_minimal() Minimal; no background panel
theme_classic() Classic axis lines, no gridlines
theme_void() No axes or gridlines — for maps, etc.
theme_ridges() Optimised for ridge plots

Position adjustments

Position Use for
position_dodge() Side-by-side bars
position_stack() Stacked bars
position_fill() 100% normalised stacked bars
position_jitter() Spread overlapping points
position_identity() Plot values exactly as given